Endogenous retroviruses mediate transcriptional rewiring in response to oncogenic signaling in colorectal cancer

Cancer cells exhibit rewired transcriptional regulatory networks that promote tumor growth and survival. However, the mechanisms underlying the formation of these pathological networks remain poorly understood. Through a pan-cancer epigenomic analysis, we found that primate-specific endogenous retroviruses (ERVs) are a rich source of enhancers displaying cancer-specific activity. In colorectal cancer and other epithelial tumors, oncogenic MAPK/AP1 signaling drives the activation of enhancers derived from the primate-specific ERV family LTR10. Functional studies in colorectal cancer cells revealed that LTR10 elements regulate tumor-specific expression of multiple genes associated with tumorigenesis, such as ATG12 and XRCC4. Within the human population, individual LTR10 elements exhibit germline and somatic structural variation resulting from a highly mutable internal tandem repeat region, which affects AP1 binding activity. Our findings reveal that ERV-derived enhancers contribute to transcriptional dysregulation in response to oncogenic signaling and shape the evolution of cancer-specific regulatory networks.


INTRODUCTION
Cancer cells undergo global transcriptional changes resulting from genetic and epigenetic alterations during tumorigenesis (You and Jones 2012). While regulatory remodeling can arise from somatic non-coding mutations (Rheinbay et al. 2020), epigenomic studies have revealed that transformation is associated with aberrant epigenetic activation of enhancer sequences that are typically silenced in normal tissues (Cohen et al. 2017;Chapuy et al. 2013; Roe et al. 2017). Pathological enhancer activity is an established mechanism underlying tumorigenesis and therapy resistance, and therapeutic modulation of enhancer activity is an active area of investigation (Hnisz et al. 2015;Bradner, Hnisz, and Young 2017;Sur and Taipale 2016;Flavahan, Gaskell, and Bernstein 2017). However, we have a limited understanding of the molecular processes that shape and establish the enhancer landscapes of cancer cells.
Transposable elements (TEs) including endogenous retroviruses (ERVs) represent a potentially significant source of enhancers that could shape cancer-specific gene regulation (Jansz and Faulkner 2021). Many cancers exhibit genome-wide transcriptional reactivation of TEs, which can directly impact cells by promoting oncogenic mutations and stimulating immune signaling (Kong et al. 2019;Shukla et al. 2013;Rodriguez-Martin et al. 2020; Rodić et al. 2014). In addition, the reactivation of TEs is increasingly recognized to have gene regulatory consequences in cancer cells (Artem Babaian and Mager 2016).
Several transcriptomic studies have uncovered TEs as a source of cancer-specific alternative promoters across many types of cancer, with some examples shown to drive oncogene expression (A. Lamprecht et al. 2010;Edginton-White et al. 2019;Jang et al. 2019). TEs also show chromatin signatures of enhancer activity in cancer cell lines (Wang et al. 2007;Sundaram et al. 2014;Jacques, Jeyakani, and Bourque 2013), yet their functional relevance in patient tumors has remained largely unexplored. A recent study identified and characterized ERV-derived enhancers with oncogenic effects in acute myeloid leukemia (Deniz et al. 2020), but the prevalence and mechanisms of TE-derived enhancer activity are unknown for most cancer types.
Here, we leveraged the availability of published cancer epigenome datasets to understand how TEs influence enhancer landscapes and gene regulation across cancer types. Our pan-cancer analysis revealed that elements from a primate-specific ERV named LTR10 show enhancer activity in many epithelial tumors, and this activity is regulated by MAPK/AP1 signaling. We conducted functional studies in HCT116 colorectal cancer cells, and found that LTR10 elements regulate AP1-dependent gene expression at multiple loci that include genes with established roles in tumorigenesis. Finally, we discovered that LTR10 elements contain highly mutable sequences that potentially contribute genomic variation affecting cancer-specific gene expression. Our work implicates ERVs as a source of pathological regulatory variants that facilitate transcriptional rewiring in cancer.

RESULTS
To assess the contribution of TEs to cancer cell epigenomes, we analyzed aggregate chromatin accessibility maps from 21 human cancers generated by The Cancer Genome Atlas project (Weinstein et al. 2013). We defined cancer-specific subsets of accessible regions by removing regions that show evidence of regulatory activity in any healthy adult tissue profiled by the Roadmap Consortium (Fig 1A, Methods) (Roadmap Epigenomics Consortium et al. 2015). Out of 1315 total repeat families annotated in the human genome, we found 23 families that showed significant enrichment within the accessible chromatin in at least one cancer type (Fig 1B), of which 19 correspond to long terminal repeat (LTR) regions of primate-specific ERVs (Supp Table 1). These observations from chromatin accessibility data generated from primary tumors confirm previous reports of LTR-derived regulatory activity in cancer cell lines (Jacques, Jeyakani, and Bourque 2013;Wang et al. 2007;Deniz et al. 2020), and support a role for ERVs in shaping patient tumor epigenomes.

LTR10 elements exhibit cancer-specific regulatory activity
To further investigate the cancer-specific regulatory activity of ERVs, we focused on LTR10 elements, which were enriched within cancer-specific accessible chromatin for several types of epithelial tumors including colorectal, stomach, prostate, and lung tumors (Fig 1C,Supp Fig S1A). LTR10 elements (including LTR10A-G, N = 2331) are derived from the LTR of the gammaretrovirus HERV-I, which integrated into the anthropoid genome 30 million years ago (Fig 1D, 1E) (Wang et al. 2007). As our initial TCGA analysis was conducted using aggregate data for each tumor type, we first confirmed that LTR10 elements showed recurrent chromatin accessibility across colorectal tumors from multiple individual patients ( Fig 1F, Supp fig S1). We then analyzed epigenomic datasets from the HCT116 colorectal cancer cell line (Akhtar-Zaidi et al. 2012;Baranello et al. 2016;Cohen et al. 2017;Zheng et al. 2019) and found that LTR10A and LTR10F elements exhibit canonical chromatin hallmarks of enhancer activity, including enrichment of histone modifications H3K27ac and H3K4me1, the transcriptional coactivator p300, and RNA Polymerase II occupancy ( Fig   1E). We did not observe enhancer-like profiles at LTR10C elements, which have previously been identified as a source of p53 binding sites (Grow et al. 2021;Wang et al. 2007) (Supp Fig S1C). While most LTR10A and LTR10F elements are not transcribed, some show evidence of transcription as promoters for full-length non-coding HERV-I insertions or cellular transcripts (Supp Fig S1D). Therefore, elements derived from the LTR10A and LTR10F subfamilies (hereafter referred to as LTR10 elements) show robust epigenomic signatures associated with enhancer activity in colorectal cancer cells.
We next compared the epigenetic status of LTR10 elements between colorectal cancer cells and normal tissues. In multiple patient-matched epigenomic datasets (Orouji et al. 2021;Bujold et al. 2016), LTR10 elements show globally increased levels of enhancer-associated histone modifications H3K27ac and H3K4me1 in tumor samples compared to adjacent normal colorectal tissues (Fig 2A, 2B). In additional datasets generated from healthy colorectal tissue samples (Cohen et al. 2017;Bernstein et al. 2010;ENCODE Project Consortium 2012;Lister et al. 2009), LTR10 elements do not show any evidence of enhancer activity (Supp Fig S2A). Although we did not examine rare or transient normal cell populations, these observations suggest that LTR10 regulatory activity is generally restricted to tumor cells.
We expanded our analysis to include epigenomic states from all adult tissues (Roadmap Epigenomics Consortium et al. 2015). We found no evidence for LTR10 enhancer activity in normal tissues, but instead observed general enrichment of H3K9me3-associated heterochromatin marks (Fig 2C, Supp Figure S2B, Supp Table 2). To identify factors that directly bind to and potentially repress LTR10 elements, we analyzed the Cistrome database (Zheng et al. 2019) of published human ChIP-Seq datasets to identify transcriptional repressors with evidence for enriched binding within LTR10 elements.
Considering all cell types, we found that LTR10 elements are significantly enriched for binding by ZNF562, TRIM28, and SETDB1 (Fig 2D, 2E, Supp Table 3), which are components of the KRAB-ZNF transposon silencing pathway (Imbeault, Helleboid, and Trono 2017). Our analysis suggests that, as expected for most primate-specific TEs (Bruno, Mahgoub, and Macfarlan 2019), LTR10 elements are normally subject to H3K9me3-mediated epigenetic silencing in somatic tissues.

LTR10-derived enhancers are regulated by AP1
To identify which pathways are responsible for cancer-specific reactivation of LTR10 elements, we focused our Cistrome enrichment analysis on activating transcription factors in colorectal cancer cell lines. LTR10 elements were significantly enriched for binding by AP1 complex members (Fig 2F,2G,Supp Table 3) including the JUND, FOSL1, and ATF3 transcription factors. The LTR10A and LTR10F consensus sequences harbor multiple predicted AP1 binding motifs, which are enriched within LTR10 elements marked by H3K27ac in HCT116 cells ( Fig 2H). Moreover, the AP1 motifs are largely absent in other LTR10 subfamilies (Fig 2I). These analyses indicate that the cancer-specific enhancer activity of LTR10 elements is likely driven by sequence-specific recruitment of the AP1 complex.
Dysregulation of AP1 signaling occurs in many cancers, driven by mutations that cause oncogenic activation of the MAPK signaling pathway (Wagner and Nebreda 2009). We first used luciferase reporter assays to test whether LTR10 enhancer activity is affected by modulation of the MAPK/AP1 signaling pathway. We synthesized the LTR10A and LTR10F consensus sequences as well as variants where the AP1 motifs were disrupted, and cloned the sequences into an enhancer reporter construct. We measured reporter activity in HCT116 cells that were treated for 24 hrs with either TNFa to stimulate signaling or cobimetinib (a MEK1 inhibitor) to inhibit signaling. Consistent with regulation by AP1, cobimetinib treatment caused a decrease in LTR10-driven reporter activity, and TNFa caused an increase ( Fig 3A). Overall regulatory activity was greatly reduced in sequences where the AP1 motif was disrupted ( Fig 3A). These results show that LTR10 enhancer activity can be directly regulated by modulation of the MAPK/AP1 signaling pathway in cancer cells.
To test the role of the AP1 complex in endogenous LTR10 regulation, we silenced the AP1 component FOSL1 using CRISPRi to determine the impact on LTR10 transcriptional activity. Using HCT116 cells expressing dCas9-KRAB-MeCP2 (Yeo et al. 2018), we transfected a guide RNA (gRNA) targeting the FOSL1 transcription start site (TSS) and used RNA-seq to compare gene and TE expression to control cells transfected with a non-targeting gRNA. We first confirmed silencing of FOSL1 (Supp Fig S3A, S3B), then analyzed TE transcript expression summarized at the family level to account for reads mapping to multiple insertions of the same TE (Jin et al. 2015). This analysis revealed that full-length LTR10/HERV-I elements were significantly downregulated upon silencing FOSL1 (Fig 3B), supporting a direct role for the AP1 complex in regulating LTR10 activity.
Next, we investigated how endogenous LTR10 elements respond to modulation of MAPK/AP1 signaling at the chromatin level. We treated HCT116 cells with either cobimetinib or TNFa for 24 hrs and profiled each response using RNA-seq and H3K27ac CUT&RUN (Supp Fig S3C, S3D, S3E, S3F). Consistent with our reporter assay results, our RNA-seq analysis showed that full-length LTR10/HERV-I transcripts were significantly downregulated upon cobimetinib treatment, and upregulated upon TNFa treatment ( Fig   3B). LTR10 elements showed similar responses based on H3K27ac CUT&RUN signal, exhibiting significant enrichment within the genome-wide set of predicted enhancers downregulated by cobimetinib or upregulated by TNFa (Fig 3C, 3D). We also observed clear TNFa-induced H3K27ac signal over LTR10 elements in a published dataset of SW480 colorectal cancer cells (Rahnamoun et al. 2017) (Supp Fig S3G). These results indicate that LTR10 elements represent a significant subset of genome-wide enhancers and transcripts in HCT116 cells that are directly modulated by MAPK/AP1 signaling.

LTR10 elements regulate cancer-specific pathological gene expression
To determine whether any LTR10-derived enhancers have a functional effect on MAPK-dependent gene expression in colorectal cancer cells, we used our RNA-seq and CUT&RUN data from HCT116 cells to identify elements predicted to have gene regulatory activity. We defined a set of 620 MAPK-regulated genes that were upregulated by TNFa and downregulated by cobimetinib ( Fig 3E, Supp Table 4). We identified LTR10 elements predicted to regulate these genes using the activity by contact model (Fulco et al. 2019) to assign enhancer-gene targets based on H3K27ac signal and chromatin interaction data (Supp Table 5). This yielded 57 LTR10-derived enhancers predicted to regulate 74 MAPK-regulated genes, including many with established roles in cancer pathophysiology ( Fig 3F, 3G, Supp Table 6).
We next used CRISPRi to experimentally silence predicted LTR10-derived enhancers in HCT116 cells. We first focused on a predicted tumor-specific enhancer (LTR10.ATG12) formed by two LTR10F elements on chromosome 5, located 87 kb from predicted target genes ATG12 and AP3S1 ( Fig 4A). The LTR10.ATG12 element shows characteristic epigenomic signatures of tumor-specific enhancer activity and binding by AP1 transcription factors. To silence the element, we stably transfected cells with a pool of four gRNAs targeting unique sequences within the element. As a positive control, we separately transfected cells with a single guide RNA targeting the ATG12 TSS. We determined the genome-wide transcriptional effects of CRISPRi silencing by using RNA-seq to compare gene expression against cells transfected with a non-targeting gRNA.
We verified that silencing ATG12 resulted in the expected specific downregulation of ATG12 (Supp Fig S4A, S4B). In cells where the LTR10.ATG12 element was silenced, the 8 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint most significantly downregulated genes were ATG12 and AP3S1 (Fig 4B, 4C, Supp Table   7), confirming the predicted enhancer function of the element. In addition, we observed downregulation of other nearby genes including ARL14EPL (Fig 4B, 4C), consistent with enhancer activity that affects multiple genes in the locus. Genome-wide, we observed differential regulation of other genes, possibly due to indirect effects from target gene knockdown or off-target silencing of other LTR10 elements (Supp Fig 4C). Notably, we observed that multiple genes regulated by LTR10.ATG12 showed similar patterns of transcriptional downregulation in response to FOSL silencing and cobimetinib treatment ( Fig 4D). These results indicate that LTR10.ATG12 acts as an enhancer that controls AP1-dependent transcriptional activation of multiple genes in the ATG12/AP3S1 locus in HCT116 cells.
The ATG12 gene encodes a ubiquitin-like modifier required for macroautophagy as well as mitochondrial homeostasis and apoptosis (Haller et al. 2014;Rubinstein et al. 2011;Radoshevich et al. 2010;Hanada et al. 2007). Expression of ATG12 is associated with tumorigenesis and therapy resistance in colorectal and gastric cancer (Hu et al. 2018;YiRen et al. 2017), but the mechanism of cancer-specific regulation of ATG12 has not been characterized. Therefore, we aimed to determine whether the LTR10.ATG12 enhancer was responsible for regulating ATG12 expression and activity in HCT116 cells.
First, we validated that silencing the enhancer resulted in decreased ATG12 protein levels by immunoblotting ( Fig 4E). In cells where either ATG12 or the enhancer was silenced, there was a clear reduction in protein levels of both free ATG12 and the ATG3-ATG12 conjugate. There was minimal knockdown effect on the levels of the ATG5-ATG12 conjugate, which has previously been observed in ATG12 silencing experiments and is due to the high stability of the ATG5-ATG12 complex (Haller et al. 2014).
Next, we tested whether ATG12-dependent functions require the activity of the LTR10.ATG12 enhancer. We treated each cell line with staurosporine (STS) to trigger mitochondrial apoptosis, which is dependent on free ATG12 binding to Bcl-2 (Rubinstein et al. 2011). In cells where either ATG12 or the enhancer was silenced, we observed significantly reduced caspase 3/7 activity, indicating defective mitochondrial apoptosis ( Fig   4F). We did not detect differences in macroautophagy in cells treated with bafilomycin, consistent with the lack of knockdown of the ATG5-ATG12 conjugate (Hanada et al. 2007).
Our experimental results from silencing both ATG12 and the enhancer are concordant with previous studies directly silencing ATG12 using siRNAs in other cancer cell lines (Rubinstein et al. 2011;Haller et al. 2014). Together, these experiments demonstrate that the LTR10.ATG12 enhancer is functionally important for ATG12-dependent activity in HCT116 cells.
Having established the regulatory activity of LTR10.ATG12, we selected 5 additional LTR10 elements for characterization based on their proximity to predicted target genes with established cancer relevance ( Fig 5). We separately silenced each LTR10 element using CRISPRi and selected one element (LTR10.KDM6A) to delete using CRISPR/Cas9, due to its intronic location (Supp Fig S5). We used RNA-seq to determine the transcriptional consequences of perturbing each element. As we observed with LTR10.ATG12, we observed local downregulation of multiple genes within 1.5 MB of each targeted element, confirming their activity as functional enhancers in HCT116 cells. These

LTR10 elements contain highly mutable VNTRs
Lastly, we investigated variation at LTR10 elements across 15,708 human genomes profiled by the Genome Aggregation database (gnomAD) (Karczewski et al. 2020). All 10 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint LTR10 insertions are fixed, but we observed an unexpected enrichment of >10bp indel structural variants affecting the AP1 motif region specific to LTR10A and LTR10F, but not other LTR10 subfamilies such as LTR10C (Fig 6A). Further sequence inspection revealed that LTR10A and LTR10F elements contain an internal variable number of tandem repeats (VNTR) region, composed of a 28-30 bp sequence that includes the AP1 motif ( Fig 6B).
Individual elements annotated in the reference genome show extensive variation in tandem repeat length, with up to 33 copies of the AP1 motif (Supp Fig S6A). The number of motifs strongly correlated with H3K27ac and FOSL1 binding activity in HCT116 cells (Supp Fig S6A), suggesting that tandem repeat length affects AP1-dependent regulation of individual elements. Across the human population, LTR10A and LTR10F elements harbor many rare and common indel structural variants of lengths that follow a 28-30 bp periodicity, and this pattern is absent in LTR10C elements which lack the tandem repeat region (Fig 6C, 6D). These elevated levels of polymorphism across copies and individuals are characteristic of unstable tandem repeat regions (Gymrek 2017), and suggest that LTR10 VNTR regions may be a common source of genomic regulatory variation.
Accurately genotyping tandem repeat length polymorphisms remains a major challenge using short-read data, therefore we validated the presence of LTR10 VNTR polymorphisms using long-read structural variant calls generated from 15 individuals (Audano et al. 2019). We recovered indel structural variants within 24 individual LTR10A and LTR10F elements, which also showed 28-30 bp periodicity ( Fig 6D, Supp Fig S6B).
We confirmed the presence of additional LTR10 VNTR indels using a separate long-read dataset from 25 Asian individuals (Fig 6E, Supp Fig S6C) (Quan et al. 2021). At the LTR10.ATG12 locus, we observed multiple indels supported by both short-read and long-read data that are predicted to affect AP1 motif copy number (Fig 6E). At a genome-wide level, LTR10 elements were a significantly enriched source of long-read indels, despite being fixed in the population (Fig 6F). Finally, in addition to finding extensive germline genetic variation affecting LTR10 elements, we also found evidence for tumor-specific somatic mutations within LTR10 elements, based on tumor-specific variants called by the International Cancer Genome Consortium (Supp Fig S6D) (ICGC/TCGA 11 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10. 1101 Pan-Cancer Analysis of Whole Genomes Consortium 2020). Together, these analyses reveal that LTR10 VNTR regions are a source of non-coding polymorphisms, which potentially contribute to both germline and somatic regulatory variation in cancer cells. Our study used CRISPR to functionally validate multiple LTR10 elements with predicted enhancer activity, revealing their involvement in regulating genes with cancer relevance.

DISCUSSION
As a case example, we demonstrated that a LTR10 element was required for ATG12-dependent mitochondrial apoptosis in HCT116 colorectal cancer cells. ATG12 expression has been implicated in tumorigenesis and therapy resistance in several cancer types, making it a candidate for therapeutic targeting (Rubinsztein, Codogno, and Levine 2012;Hu et al. 2018). Notably, MAPK inhibitor treatment results in silencing LTR10 regulatory activity, suggesting that ERV enhancer inactivation may be an underlying mechanism contributing to therapeutic MAPK inhibition.
12 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10. 1101 Finally, we discovered that LTR10 elements are frequently affected by indels that could influence their regulatory activity. Although all LTR10 insertions are fixed in the human population, they contain internal tandem repeats that show high levels of length polymorphism associated with repeat instability. Germline or somatic variation in AP1 motif copy number within these elements may alter cancer-specific enhancer landscapes, and we speculate that LTR10 VNTRs may be subject to further mutation in cancer cells with microsatellite instability (Hung et al. 2019). Our study of LTR10 highlights how TEs that are normally silenced can become reactivated in cancer and cause aberrant gene expression.
For elements that promote pathogenesis, their restricted activity in age-associated diseases like cancer may result in reduced or nearly neutral fitness consequences.
Therefore, the accumulation of TEs subject to epigenetic silencing may be a fundamental process shaping the evolution of cancer-specific gene regulatory networks.

Cell culture
The HCT116 cell line was purchased from ATCC and cultured in McCoy's 5A medium supplemented with 10% FBS and 1% penicillin/streptomycin (Gibco). Cells were cultured at 37°C in 5% carbon dioxide. Transfections were performed using FuGENE (Promega).
For treatments modulating MAPK signaling, HCT116 cells were untreated or treated for 24 hrs with 1 uM Cobimetinib, 100 ng/uL TNF alpha, or DMSO.

CRISPR-mediated silencing and knockout of LTR10s
For CRISPR-mediated silencing (e.g., CRISPRi) of select LTR10 elements and gene transcription start sites (TSS), a HCT116 dCas9-KRAB-MeCP2 stable line was first For CRISPR-mediated knockout of LTR10.KDM6a, 2 gRNAs (1 specific to each flank of the element) were identified and synthesized as sgRNAs by IDT. Using IDT's AltR technology, RNP complexes were generated in vitro, and electroporated into HCT116 cells 14 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint using the Neon system (ThermoFisher Scientific). Clonal lines were isolated using the array dilution method in a 96-well plate format, and single clones were identified and screened for homozygous deletions by PCR using both flanking and internal primer pairs at the expected deletion site.

Luciferase assay
Reporter assays were conducted in HCT116 cells using the secreted NanoLuc enhancer activity reporter pNL3.3 (Promega) and normalized against a constitutively active firefly luciferase reporter vector, pGL4.50 (Promega). LTR10 consensus sequences for 15 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Barplots are presented as mean +/-s.d. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
MACS2 was also run in single-end mode with additional parameters --shift -75 and 17 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ;https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint --extsize 150, andBedtools (v2.28.0) was used to merge peaks across the two modes of peak calling for each sample (bedtools merge with options -c 5 -o max).
RNA-seq and PRO-seq reads were aligned to hg38 using hisat2 (v2.1.0) with option --no-softclip and filtered for uniquely mapping reads with samtools for MAPQ > 10. Bigwig tracks were generated using the bamCoverage function of deepTools (v3.0.1), with CPM normalisation (ignoring chrX and chrM) and bin size 1bp. Some datasets from TCGA, ENCODE, Cistrome DB and the CEMT Canadian Epigenomes Project were downloaded as post-processed peaks and bigwig files.

TE colocalization analysis
To determine TE family enrichment within regulatory regions, we used GIGGLE (v0.6.3) to generate a genomic interval index of all TE families in the hg38 human genome, based on Dfam v2.0 repeat annotation (N=1315 TE families). Regulatory regions (e.g., ATAC, ChIP-Seq, or CUT&RUN peaks) were queried against the TE interval index using the giggle search function (-g 3209286105 -s). Results were ranked by Giggle enrichment score, which is a composite of the product of −log10(P value) and log2(odds ratio).
Significantly enriched TE families were defined as those with at least 25 overlaps between TE copies and a set of peak regions, odds ratio over 10, and Giggle score over 100 in at least one cancer type.

Defining cancer-specific regulatory elements
To define cancer-specific regulatory elements, we obtained aggregate ATAC-seq regions associated with each tumor type profiled by The Cancer Genome Atlas (Weinstein et al. 2013), which represent a union of recurrent ATAC-seq regions associated with each tumor type. To filter predicted regulatory regions in healthy adult tissues, we used chromHMM regulatory regions defined by the Roadmap project, using healthy adult tissues from categories 1_TssA, 6_EnhG & 7_Enh (e.g. filtering out placental tissues, embryonic stem cells and trophoblast stem cells due to high numbers of active TEs). Cancer-specific 18 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint regulatory regions were defined using the subtract function of bedtools (option -A) to subtract Roadmap regulatory regions from each cancer peak set.

Transcription factor motif analyses
Motif analysis of LTR10 elements was performed using MEME suite (v5.4.1) in differential enrichment mode (Bailey et al. 2009). HCT116 CUT&RUN H3K27ac-marked LTR10A/F sequences (N=144) were used as input against a background set of unmarked LTR10A/F sequences (N=561), with default settings other than the number of motif repetitions (Any) and the number of motifs to find (5). Each discovered motif was searched for similarity to known motifs using the JASPAR 2018 non-redundant DNA database with TomTom (v5.4.1). FIMO (v5.1.0) was then used to extract motif frequency and hg38 genomic coordinates, with p-value threshold set to 1e-4.

Differential analysis using DESeq2
For RNA-seq samples, gene count tables were generated using featureCounts from the subread (v1.6.2) package with the GENCODE v34 annotation gtf to estimate counts at the gene level, over each exon (including -p to count fragments instead of reads for paired-end reads; -O to assign reads to their overlapping meta-features; -s 2 to specify reverse-strandedness; -t exon to specify the feature type; -g gene_id to specify the attribute type).
To quantify TE expression at the family level, RNA-seq samples were first re-aligned to hg38 using hisat2 with -k 100 to allow multi-mapping reads and --no-softclip to disable soft-clipping of reads. TEtranscripts (v2.1.4) was then used in multi-mapping mode with the GENCODE v34 annotation gtf and hg38 GENCODE TE gtf to assign count values to both genes and TE elements.
For H3K27ac CUT&RUN samples, the bedtools multicov was used to generate a count table of the number of aligned reads that overlap MACS2-defined peak regions. The top 19 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint 20,000 peaks were extracted from each sample and merged (using bedtools merge with -d 100) to produce the peak file used as input to bedtools multicov.
The same DEseq2 analyses were used to identify differentially enriched peak regions between H3K27ac CUT&RUN samples (e.g. in response to MAPK treatment). Significantly differentially enriched regions were queried against the Giggle index of human repeats to identify over-represented TE families.
Prediction of LTR10 enhancer gene targets LTR10 elements were initially prioritized for CRISPR silencing or deletion based on enhancer predictions from the Activity-by-Contact (ABC) model (Fulco et al. 2019). Publicly available HCT116 ATACseq (GEO accession GSM3593802) and in-house HCT116 H3K27ac CUT&RUN were used as input to the ABC pipeline, as well as the provided averaged human cell line HiC file. Predicted enhancer regions with an ABC interaction score over 0.001 were intersected with H3K27ac-marked LTR10A/F elements. Putative LTR10 enhancers were then checked for proximity (e.g. within 1.5Mb) to MAPK-regulated genes (i.e. genes that were significantly affected by MAPK treatments Cobimetinib and TNF-alpha, based on inhouse RNAseq).

Evolutionary analysis of LTR10 sequences
Genomic coordinates of LTR10 elements in the hg38 human genome were obtained from Dfam v2.0, based on RepeatMasker v4.0.6 repeat annotation. The nucleotide sequence of each LTR10 element was extracted using the getfasta function from bedtools (using -name+ to include coordinates in the header and -s for strand specificity). VSEARCH v2.14.1 was used to set a minimum length threshold of 200bp for LTR10 sequences (--sortbylength -minseqlength 200), prior to alignment. MUSCLE (v3.8.1551) was used to align the remaining sequences. Jalview (v2.11.1.4) was used to perform a principal 20 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 component analysis on pairwise similarity scores derived from the multiple sequence alignment.
LTR10 consensus sequences representing each LTR10 subfamily (A-G) were likewise downloaded from Dfam. Sequences were concatenated into one FASTA file and aligned using MUSCLE. FastTree was used to infer a maximum likelihood phylogeny representing the LTR10 subfamily relationships.
The phylogeny of known primate relationships was obtained from TimeTree (Kumar et al. 2017) and the HERV-I insertion estimate was confirmed based on the presence or absence of LTR10 sequences across mammals, using BLAST (v2.7.1+) (Altschul et al. 1990).
The remaining indels were then subsetted by size to retain insertions or deletions between 10 to 300bp in length, and chromosome vcfs were concatenated into one whole genome bed file. Bedtools (v2.28.0) was used to intersect the indel bed file with LTR10 elements from each subfamily, based on Dfam (v2.0) repeat annotation. Indels from long-read datasets were likewise filtered by variant type (INS or DEL) and length (50-300bp), then intersected with LTR10 elements. Deletion length versus allele frequency was plotted for each subfamily, from each separate dataset.

DATA AVAILABILITY
Inhouse high-throughput sequencing data has been deposited in the Gene Expression  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.28.466196 doi: bioRxiv preprint ChIP-Seq, gnomAD indels between 10-300bp in length, and AP1 motif matches (p<1e-4) across LTR10A, LTR10F, and LTR10C elements. Overlapping elements were removed, retaining 990 LTR10 elements total across the three subfamilies. showing AP1 motifs, long-read indels (58bp deletion reported by Quan et al., 2021), and gnomAD indels. (F) GIGGLE enrichment of ERVs within long-read indels.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 30, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021